Exploratory Analysis
Univariate analysis
First, I want to review basic information and the descriptive statistics of the data set to get a better handle on the data set, and see if anything stands out.
Here are the dimensions of the data set:
## [1] 113937 81
I will now setup the new data frame to only include the variables listed above.
## [1] "ListingCategory..numeric."
## 'data.frame': 113937 obs. of 19 variables:
## $ Term : int 36 36 36 36 36 60 36 36 36 36 ...
## $ LoanStatus : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
## $ BorrowerState : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
## $ ListingCategory..numeric.: int 0 2 0 16 2 1 1 2 7 7 ...
## $ CreditScoreRangeLower : int 640 680 480 800 680 740 680 700 820 820 ...
## $ CreditScoreRangeUpper : int 659 699 499 819 699 759 699 719 839 839 ...
## $ BankcardUtilization : num 0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
## $ IncomeRange : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
## $ TotalProsperLoans : int NA NA NA NA 1 NA NA NA NA NA ...
## $ LoanOriginalAmount : int 9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
## $ Investors : int 258 1 41 158 20 1 1 1 1 1 ...
## $ ListingCreationDate : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 14184 111894 6429 64760 85967 100310 72556 74019 97834 97834 ...
## $ Occupation : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
## $ IsBorrowerHomeowner : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
## $ BorrowerAPR : num 0.165 0.12 0.283 0.125 0.246 ...
## $ BorrowerRate : num 0.158 0.092 0.275 0.0974 0.2085 ...
## $ Recommendations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ DebtToIncomeRatio : num 0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
## $ StatedMonthlyIncome : num 3083 6125 2083 2875 9583 ...
Although not necessarily pertinent to our analysis, I am curious about the distribution for number of investors per loan, therefore I will start off by plottig a histogram.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 2.00 44.00 80.48 115.00 1189.00

The number of investors per loan seem to be heavily positive skewed. I will try performing a log transformation.

After performing the transformation, we can see that the distribution is more normal and restores the symmetry, aside from the spike at zero. What is interesting is that most loans - not taking into consideration the first spike - seem to be funded by multiple investors, with a peak at around 80 investors.
Next, I will evaluate the distribution for the original amount of the loans.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 4000 6500 8337 12000 35000

The amount of the loans range from $1,000 to $35,000 with a median of $6,500. There are some peaks in the distribution that require further analysis.
Zooming in, I see that these peaks represent multiples of $500 and $1000. There are peaks in the distribution at $4,000, $10,000, $15,000, $20,000 and $25,000. Because of the skewed distribution I decided to use a log scale for this variable, leading to a distribution that is closer to a normal distribution.
An obvious factor that can influence if someone is able to repay their loan is their annual income.
Most borrowers appear to have annual incomes ranging from $25,000 to $75,000. However, this in itself doesn’t indicate too much, as someone with high income could also have a large amount of pre-existing debt. Therefore, I will look at the distribution of the Debt-to-Income ratio. This variable shows the ratio at the time a credit check was performed for the borrower.

Most borrowers have a ratio under 50%. For context, to be elible for a Qualified Mortgage, the debt-to-income ratio should typically be less than 43%.
Lastly, an important factor to look at is the borrower’s credit score. We have been provided with both the upper and the lower end range of a borrower’s credit score, therefore we will look at both.


## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 660.0 680.0 685.6 720.0 880.0 591
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 19.0 679.0 699.0 704.6 739.0 899.0 591
Not surprisingly, the plots for both the upper and normal ranges resemble each other. This is likely due to the fact that the range intervals provided are the same across all borrowers. There are some outliers with a score near 0, which mean that the borrower has no credit history. Note that credit scores usually are in the range of 300 to 850. Based on the distribution, it appears as though to get a loan, a borrower requires a score of at least 620, which is confirmed on Prosper’s website. The range for the lower end of the credit score is from 0 to 880, and 19 to 899 for the upper end.
To focus on credit scores typically found, I will limit the x-axis from 300 to 850.


With the limit in place, it is easier to see that the typical range is actually between 670-720, which is considered a good score.
Univariate Analysis using Values Grouped by Occupations For Delinquency Evaluation
There are a few categorical variables within the data set. At first glance, BorrowerState has 52 states or levels and Occupation has 68 levels/categories for over 113k observations - it should be possible to nicely group by occupations to see any trends.
Example of categorical values in “LoanStatus”:
## [1] "Cancelled" "Chargedoff"
## [3] "Completed" "Current"
## [5] "Defaulted" "FinalPaymentInProgress"
## [7] "Past Due (>120 days)" "Past Due (1-15 days)"
## [9] "Past Due (16-30 days)" "Past Due (31-60 days)"
## [11] "Past Due (61-90 days)" "Past Due (91-120 days)"
There are a few challenges here. If I’m investigating delinquencies, I need to define a definition for it, and since the Prosper data has a few categories that could cover this, I need to do two things 1) learn the difference between Chargedoff and Defaulted and 2) assign a cutoff line for delinquents, i.e if someone is 1-15 days late on payment is that delinquent? What about 61-90 days late?
Chargedoff:
A charge-off or chargeoff is the declaration by a creditor (usually a credit card account) that an amount of debt is unlikely to be collected. This occurs when a consumer becomes severely delinquent on a debt. Traditionally, creditors will make this declaration at the point of six months without payment. In the United States, Federal regulations require creditors to charge-off installment loans after 120 days of delinquency, while revolving credit accounts must be charged-off after 180 days
– Wikipedia
Defaulted:
In finance, default is failure to meet the legal obligations (or conditions) of a loan,[1] for example when a home buyer fails to make a mortgage payment, or when a corporation or government fails to pay a bond which has reached maturity.
– Wikipedia
Define Delinquent Borrowers
# New variable to be used to identify "delinquent" borrowers
prosperloans$DelinquentBorrowers <- ifelse(
prosperloans$LoanStatus == "Defaulted" |
prosperloans$LoanStatus == "Chargedoff" |
prosperloans$LoanStatus == "Past Due (61-90 days)" |
prosperloans$LoanStatus == "Past Due (91-120 days)" |
prosperloans$LoanStatus == "Past Due (>120 days)",
1, 0)
The first thing I wanted to do was look at the creation date distribution of observations in the data set. This is important as this could cause bias in further investigation, for example, if the majority of the data was observed during 2008-2009 (the financial crisis) this could skew the data towards having a majority of delinquency. To analyze this, I’m going to explore the ListingCreationDate variable along with LoanStatus.
Loan Creation Year Distribution

In order to help facilitate my analysis, I’ve decided to create a new variable that will capture the year segment of the loan creation date.
From first glance, there is an obvious dip in borrowers, both delinquent and otherwise across 2008 and 2009. My first thoughts were connecting this fall in borrowers to the financial crisis which occurred in late 2008. However, I then realized that we were looking at the “creation” date of the loan - meaning the transaction date of the loan - not the date that the loan was considered delinquent. Therefore, loans that are considered delinquent have gone delinquent at some point during the life cycle of the loan.
I believe that the distribution of delinquent vs non-delinquent accounts can be at least partly attributed to the financial crisis though. Borrowers took out loans at a “normal” rate from 2006-2008 but potentially after the financial crisis in 2008, borrowers could no longer cover their monthly payments and loans created 2006-2008 were the most susceptible to defaults. This assumption also explains the dip in 2009 and gradual increase over the next few subsequent years, less borrowers and much fewer delinquents in comparison to the level of loans in good standing.
However, after finalizing the plot and reviewing both good standing and delinquent borrowers, it seems that during these years there was no overwhelming number of delinquent borrowers. It appears total borrowers (and/or investors) lowered during this time which may or may not be due to the financial crisis.
Time to further the analysis and investigate some other interesting features. The next variables to explore are: Term, LoanStatus, LoanOriginalAmount, Income Range, Borrower’s State and Listing Category (the reason for a loan).
Term and Loan Status
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12.00 36.00 36.00 40.83 36.00 60.00

From the summary breakdown, the average term appears to be 36 months (3 yrs) with an original loan request being $8337, ranging from $1000 to $35,000. This seems to cover the full available range allowed by Prosper, although at the time of writting this analysis, Prosper now has a minimum loan limit of $2000.
It’s also very clear that the majority of borrowers choose a 36 month term, where 60 and then 12 month terms are 2nd and 3rd most popular, respectively. And approximately 15% of the total borrowers are delinquent on their loans. Thinking of the distribution of delinquency across terms, one would imagine larger term equates to a larger loan and monthly payment amounts. This in turn would contribute to having more delinquencies. However, the distribution of delinquency is surprisingly very similar to that of the total loan distribution across loan terms. An investigation around loan amounts and across terms may reveal more, i.e if loans with larger amounts are associated to the borrowers with better credit scores, it’s more likely that these loans will be paid off and borrowers will be in good standing.
Original Loan Amount
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 4000 6500 8337 12000 35000

Summary data indicates that the mean and median are quite far apart. This is visualized in the graph - which is right skewed. The spikes of borrowers at $10,000, $15,000 and $25,000 are pulling the mean higher. This isn’t actually all that surprising as most people will gravitate towards round numbers.

Taking a closer look at 95% of the data - specifically the data for loans under $20,000 - you can see multiple spikes at most of the round numbers - such as 1,000, 2,000,…,8000, 9000 and 10,000 dollars. Since most debt doesn’t come in nice round numbers, we can assume borrowers are probably rounding up instead to requesting exact dollar values for their loans. For example, if a borrower is $4,678.86 dollars in debt, she is probably requesting a loan of $5,000.00. From the data provided, I don’t believe there is in specific way to determine if this assumption is true, however, given that debt / money is a floating value with interest rate applied as floating value, it is quite unlikely that the majority of borrowers have debt divisible by 5.
Income Range

Reviewing income range allows us to see the distribution of total borrowers and their income with a $25k dollar bin / range. This could shine light into the financial standing of borrowers and could be a contributing factor to delinquency.
The distribution of borrowers across income range is somewhat normal if we consider that there are a number of users in the “Not Displayed” category. It was quite surprising to see so many borrowers with stated incomes of over $100,000.
Borrowers State


Borrowers state covers 52 states including Alaska and Hawaii. Representation for California is skewing the graph, making it difficult to distinguish smaller state counts and differences among states. Given California is the most populous state in the US, having a much higher count in borrowers is probably not a good indicator of a trend. To fix this and to better view any potential trends, a log10 transformation was applied to the y axis.
California has the highest count of delinquents but it should be taken into consideration that this state also has the highest population count, with Florida, Texas and the East coast following suite. Noticeably, “middle America” has the lowest levels of delinquent borrowers but this could also be due to lower levels in population and cost of living, which could result in people having a better financial situation not requiring loans, etc or a host of other reasons outside of the information the data set provides.
Two other unexpected results from the plot shows 1) there are no states with more delinquent borrowers than those in good standing and 2) the states of ND and WY have negligible numbers of delinquent borrowers.
Listing Category


The listing category values show the reason for the loan. For example, if the value is “Auto”, it’s likely that the borrower is using the funds to finance a car.
Based on the above plot, although “Debt Consolidation” is by far the most common reason borrowers require a loan, this is misleading. Debt consolidation removes the specificity of where the debt came from. Same categories such as Auto, Student Use or Taxes could have contributed to someone’s debt, but the “reason” the borrower is requesting a loan is for consolidation. Therefore, when presented with all options, they may flag “Debt Consolidation” as the purpose - which ends up serving as a catch-all. One thing we can take away from the high volume of borrowers requesting debt consolidation loans is that many borrowers have debt from multiple sources.
Bivariate analysis
Continuing to look for trends in delinquencies, I will investigate possible expected and unexpected relationships between features with tools such as the scatter plot and box plot.
Listing Cateogry and Loan Amount


In the above two plots, I examine the loan category, loan amounts and borrowing rate. Debt consolidation as expected has the highest loan amounts across categories; however, unexpectedly “Baby and Adoption” category is tied with debt consolidation. It would be interesting to investigate the total number of borrowers across these two categories, one would expect debt consolidation to have a large distribution of borrowers.
Also of note, the “Student Use” category loan amount it quite low. Given the well documented high levels of student debt, we can assume students didn’t suddenly stop having debt but instead don’t use Prosper to manage their debt.
The borrower’s interest rate doesn’t seem to have any particular surprises. The mean borrowing rate across all categories tend to vary between 0.15 and 0.25.
Loan Term and Loan Amount

In this plot, I examine the three (3) term loans and borrower’s loan amounts. Surprisingly the 60 month term has a mean value much higher than that of the 36 month term loan. This was surprising considering our previous plot where only approx. 25k borrowers used 60 month term loans in comparison to approx. 75k borrowers who used 36 month terms. This could be a point of further investigation particularly around delinquent accounts i.e. what percentage of 60 month terms go into delinquency in comparison to the other 2 term loan periods?
Loan and Bankcard Utilization

This plot was quite disappointingly surprising as I was expecting a different/stronger relationship where loan amounts would rise as bankcard utilization lowered. Instead, we see a high volume cluster of loans under $10,000 with borrowers close to 100% bankcard utilization. I may revisit both of these variables later in the analysis but I would consider this correlation/trend/relationship discovery between these two variables a dead end.
Debt-to-Income Ratio

In comparing debt-to-income ratio with a borrower’s stated monthly income I was expecting to see a somewhat obvious, trend that delinquent borrowers would have a lower monthly income and a higher debt-to-income ratio. With over plotting in the scatter plot two additional techniques were used to more clearly reveal any trends or unexpected results.
Borrowers with stated incomes over $20k and debt-to-income ratios over 1 were considered as outliers and removed from the plot.
The density contour lines show a high concentration of delinquent borrowers earn less than $2500 a month but have a low debt-to-income ratio of under 0.50 (or 50%). The plot also suggests a negative correlation between monthly income and debt-to-income ratio, i.e the more a borrower makes in monthly income the lower their debt-to-income ratio; However, this does not guarantee the loan will not go into delinquency.
After further review from multiple online resources, it seems that a typical “good” debt-to-income ratio is under approx. 36%.
Interest Rate and Debt-to-Income Ratio 
The relationship between a borrower’s interest rate (BorrowerRate) and their debt ratio (DebtToIncomeRatio) was expected to be a positive correlation. I expected borrowers with high debt-to-income ratios would automatically receive higher interest rates on their loans. However, based on the above scatter plot, there seems to be no trend and a weak positive correlation between these two variables. Further investigation when plotting a correlation matrix will expose the details.
Multivariate Analysis
In this section, I’ll dive into examining the relationships between multiple variables of the data set. There are some expected “strong” relationships which will be investigated but also any unusual or unexpected relationships, strong or weak, will also be called out for a deep dive into the details.
First, a correlation matrix will be used to calculated the coefficients in order to help start the investigation process where by the variables with strong and/or weak relationships will be reviewed.
Correlation Matrix
# Pearson correlation coefficients, using pairwise observations (default method)
# Non-numeric columns automatically removed/ignored
ggcorr(prosperloans, label = TRUE, label_size = 3,
hjust = 0.8, size = 2.5, color = "black", layout.exp = 2)

The correlation matrix revealed a few surprising things - I thought there would be a much stronger relationship between interest rate (BorrowerRate) and the credit score (CreditScoreRangeUpper/Lower). At a score of -0.5 it’s the strongest correlation out of the selected variables. Also, the number of investors (Investors) and the borrower loan amount (LoanOriginalAmount) has a positive correlation of 0.4, which was somewhat expected as investors are only allowed to contribute a portion of a loan amount, i.e. no one investor can lend an entire loan amount.
Below, a few particularly interesting relationships will be investigated further.
Credit Score and Borrowing Rate

We see an expected result here, where the trend is lower credit scores tend to have higher borrowing rates and higher credit scores obtaining a lower borrowing rate. Delinquent borrowers also tend to have lower credit scores, this could be in part due to their higher interest rates on loans hindering payments and causing delinquencies. If lower credit scores mean higher borrowing rates, and higher rates tend to cause delinquency, how does anyone with a low credit score get out of the cycle? What’s the cause and effect? More investigation would be required to come to a conclusion. And also as expected, borrowers with higher credit scores and lower interest rate, tend to be in good standing with their loans.
Credit Score and Loan Amount
Summary: Delinquent Borrowers
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 19.0 619.0 679.0 661.5 719.0 879.0 174
Summary: Good Standing
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 19.0 679.0 719.0 712.4 739.0 899.0 417

In the above comparisons, both scatter plots show a borrower’s credit score against their loan amounts. Once again, this plot reveals some expected and unexpected results.
The higher concentration of delinquent borrowers have lower credit scores, and they also tend to borrow less money - mostly less than $10,000. This maybe due to lower credit scores often having higher interest rates on their loans. What was a bit unexpected was the lower loan amount for borrowers with much higher credit scores. Based on previous plots, higher scores provide lower rates, which would tend to allow the borrower to gain access to a higher loan amount. And although the average loan amount for loans in good standing are higher, there is still a high concentration of loans under $20,000 and with credit scores over 700. One other “weird” result which I’ve ignored so far has to do with the credit score values, for example, some borrowers have score of 899 where credit scores range only between 300 and 850.
Loan Amount and Investors

Based on the correlation matrix, a positive relationship was expected. However, the non-linear, somewhat exponential curve of investors to loan amount was unexpected. This plot also shows a wide variance from the centered curve line of the correlation. Of note, Proper doesn’t allow investors to contribute more than 10% of their net worth to any one loan. This could be the underlying factor contributing to the positive correlation between loan amount and number of investors.
Bankcard Utilization and Borrowing Rate

Revisiting bankcard utilization, this time with borrowing rate, it was unexpected to see bankcard utilization over 1 or 100% and how the typical calculation happens doesn’t seem to allow a utilization of over 100%. Another unexpected result, was the large number of borrowers with high bankcard utilization who were in good standing. Although there are a larger number of borrowers in good standing, in comparison to delinquent borrowers, good standing borrowers have a utilization rate which is more evenly spread through 0 to 100%.
Loan Amount and Year across Income Range
## [1] "Not employed" "$0" "$1-24,999" "$25,000-49,999"
## [5] "$50,000-74,999" "$75,000-99,999" "$100,000+" "Not displayed"

These were quite interesting plots, as all income ranges display the same curve (distribution across 2007 - 2014). Again, this could be a result of loans, investors and the financial crisis as mentioned previously. Also, the category “not displayed” is missing data after 2008, and one can assume that possibly this category was removed as an option for users after 2008, forcing users to provide more transparency - which also could have been a side effect of the crisis. However, only in 2007 is there data for any of the other categories, which would imply that during 2005 and 2006, none of the income range categories existed (or the data was lost/not recorded).
Also of note, borrowers with $100,000+ consistently have a wider spread of loan amounts over the years. Borrowers within the categories of “$0” and “not employed” should probably be lumped together to provide more accurate information or at least, more manageable information - since by definition, you cannot be employed with $0 income but you can be unemployed and still have income (e.g. retirees).

On a deeper dive of the same plot, loan amount by year across income range, I investigate any possible trends around delinquency and spot some unexpected results. Particularly with borrowers whose income range is over $100,000. Delinquency for this income range tends to tread higher than good standing accounts, especially with loans created prior to 2008. These delinquent loans tend to have a higher min, max and mean dollar amount. This larger delinquency gap can also be seen with borrowers of the $0 income range. According to Wikipedia, Prosper relaunched in 2009 and as part of their relaunch, they implemented stricter credit guidelines for borrowers, which largely explains the difference.